Simple Regret Optimization in Online Planning for Markov Decision Processes
نویسندگان
چکیده
منابع مشابه
Simple Regret Optimization in Online Planning for Markov Decision Processes
We consider online planning in Markov decision processes (MDPs). In online planning, the agent focuses on its current state only, deliberates about the set of possible policies from that state onwards and, when interrupted, uses the outcome of that exploratory deliberation to choose what action to perform next. The performance of algorithms for online planning is assessed in terms of simple reg...
متن کاملOnline Regret Bounds for Markov Decision Processes with Deterministic Transitions
We consider an upper confidence bound algorithm for Markov decision processes (MDPs) with deterministic transitions. For this algorithm we derive upper bounds on the online regret (with respect to an (ε-)optimal policy) that are logarithmic in the number of steps taken. These bounds also match known asymptotic bounds for the general MDP setting. We also present corresponding lower bounds. As an...
متن کاملRegret Minimization in Nonstationary Markov Decision Processes
We consider decision-making problems in Markov decision processes where both the rewards and the transition probabilities vary in an arbitrary (e.g., nonstationary) fashion to some extent. We propose online learning algorithms and provide guarantees on their performance evaluated in retrospect against stationary policies. Unlike previous works, the guarantees depend critically on the variabilit...
متن کاملOnline Markov Decision Processes
We consider a Markov decision process (MDP) setting in which the reward function is allowed to change after each time step (possibly in an adversarial manner), yet the dynamics remain fixed. Similar to the experts setting, we address the question of how well an agent can do when compared to the reward achieved under the best stationary policy over time. We provide efficient algorithms, which ha...
متن کاملRegret-based Reward Elicitation for Markov Decision Processes
The specification of a Markov decision process (MDP) can be difficult. Reward function specification is especially problematic; in practice, it is often cognitively complex and time-consuming for users to precisely specify rewards. This work casts the problem of specifying rewards as one of preference elicitation and aims to minimize the degree of precision with which a reward function must be ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Journal of Artificial Intelligence Research
سال: 2014
ISSN: 1076-9757
DOI: 10.1613/jair.4432